-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output eval logging (batch level) #2977
Output eval logging (batch level) #2977
Conversation
…r into error_logging_callback
Add pytorch nightly and CUDA 12.1 support for composer docker images What issue(s) does this change relate to? Related to https://mosaicml.atlassian.net/browse/GRT-2305 Tests docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0) mcli connect temp-test-ZAVxMh Python 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.version) <module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'> >>> print(torch.__version__) 2.1.0.dev20230623+cu121 >>> print(torch.version.cuda) 12.1 Integration Test @mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU
* fix autoresume with slashed directory * Revert "fix autoresume with slashed directory" This reverts commit 3dfb5f5. revert * fix * fix precommit * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * add tests
Signed-off-by: Prithvi Kannan <[email protected]> Co-authored-by: Evan Racah <[email protected]> Co-authored-by: eracah <[email protected]>
Upstreams and generalizes the callback that logs generations to wandb from foundry to composer.
…2476) Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23.
Test pytorch 2.1.0 docker images on ci/cd mosaicml#2469
Will be removed in v0.18.
* Update RTD build config with build.os * Remove python.version --------- Co-authored-by: Bandish Shah <[email protected]>
# What does this PR do? Security vulnerability in `semver` seen due to node. This PR upgrades the node version to bump up semver from 7.5.1 to 7.5.2 # Tests Action Run: https://github.com/mosaicml/composer/actions/runs/6017539089 Correct version of semver seen after upgrade: ``` mosaicml#14 [pytorch_stage 7/24] RUN npm list -g semver --depth=1 mosaicml#14 2.223 /usr/lib mosaicml#14 2.223 `-- [email protected] mosaicml#14 2.223 `-- [email protected] mosaicml#14 2.223 mosaicml#14 DONE 2.4s ```
* Gating tying modules w/ FSDP * Changing weight tying filtering to be less aggressive * precommit formatting
* Removing min_params * formatting? * removing overlap with another commit
* add fix * fix tests * qwf * dsfg * add key * remove short * add map test * remove comment * filter warning * simplify wrapping * checkdown * fix torchmetrics * 300 * fix tests * remove metric * cleanup * bug fixes * fix lint * fix lint * fix test * lint * remove cuda * fix tests * fix ignore * fix loading * fix test * save ckpt --------- Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: Your Name <[email protected]>
* Adding some fixes to FSDP tests * Add filter warnings
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I mostly looked closely at the callbacks and loggers part
Also I just want to say great work! This is a herculean PR requiring deep, bespoke knowledge while juggling several different parts of the composer codebase. Not an easy one to wrangle and seems like you managed to make it work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im holding until a few things are resolved. But this is my top priority to help u land :)
…some/composer into error_logging_callback_in_batch
* prelim commit * fix max answer lengths for cot * add output logger * create eval output logger * fix pyright; git push * change dist reduce fx * change dist reduce fx * fix pyright * Add nightly docker image (#2452) Add pytorch nightly and CUDA 12.1 support for composer docker images What issue(s) does this change relate to? Related to https://mosaicml.atlassian.net/browse/GRT-2305 Tests docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0) mcli connect temp-test-ZAVxMh Python 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.version) <module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'> >>> print(torch.__version__) 2.1.0.dev20230623+cu121 >>> print(torch.version.cuda) 12.1 Integration Test @mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU * Fix local eval (#2465) * fix autoresume with slashed directory * Revert "fix autoresume with slashed directory" This reverts commit 3dfb5f5. revert * fix * fix precommit * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * add tests * Add torch 2.1.0 args for github release-docker workflow * Log system metrics on each event (#2412) Signed-off-by: Prithvi Kannan <[email protected]> Co-authored-by: Evan Racah <[email protected]> Co-authored-by: eracah <[email protected]> * Fix torch 2.1.0 docker tag (#2472) * Upstream Generate Callback (#2449) Upstreams and generalizes the callback that logs generations to wandb from foundry to composer. * Upgrade torch nightly docker image for 0.18.3 NCCL version (#2476) Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23. * Test pytorch 2.1.0 docker images on ci/cd (#2469) Test pytorch 2.1.0 docker images on ci/cd #2469 * Fix huggingface tokenizer loading for slow tokenizers (#2483) * Deprecate Fused LayerNorm (#2475) Will be removed in v0.18. * Transformers upgrade (#2489) * Update RTD build config with build.os (#2490) * Update RTD build config with build.os * Remove python.version --------- Co-authored-by: Bandish Shah <[email protected]> * Upgrade torch docker version and github workflow tests (#2488) * upgrade node version (#2492) # What does this PR do? Security vulnerability in `semver` seen due to node. This PR upgrades the node version to bump up semver from 7.5.1 to 7.5.2 # Tests Action Run: https://github.com/mosaicml/composer/actions/runs/6017539089 Correct version of semver seen after upgrade: ``` #14 [pytorch_stage 7/24] RUN npm list -g semver --depth=1 #14 2.223 /usr/lib #14 2.223 `-- [email protected] #14 2.223 `-- [email protected] #14 2.223 #14 DONE 2.4s ``` * Gating tying modules w/ FSDP for torch 2.0 (#2467) * Gating tying modules w/ FSDP * Changing weight tying filtering to be less aggressive * precommit formatting * Removing min_params (#2494) * Removing min_params * formatting? * removing overlap with another commit * Fix torchmetrics backwards compatibility issue (#2468) * add fix * fix tests * qwf * dsfg * add key * remove short * add map test * remove comment * filter warning * simplify wrapping * checkdown * fix torchmetrics * 300 * fix tests * remove metric * cleanup * bug fixes * fix lint * fix lint * fix test * lint * remove cuda * fix tests * fix ignore * fix loading * fix test * save ckpt --------- Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: Your Name <[email protected]> * Adding some fixes to FSDP tests (#2495) * Adding some fixes to FSDP tests * Add filter warnings * fail count (#2496) * Remove PR curve metrics from backward compatibility test and skip torch 1.13 (#2497) * filter warning (#2500) * bump version (#2498) * Skip metrics in state dict (#2501) * skip metrics in state dict * fix unit tests * Add peak memory stats (#2504) * add peak memory stats * fix tests * fix sharded ckpt (#2505) * Bump gitpython from 3.1.31 to 3.1.34 (#2509) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.31 to 3.1.34. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.31...3.1.34) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Annotate `torch_prof_remote_file_name` as Optional (#2512) The `torch_prof_remote_file_name` argument of `Profiler` is passed as the `remote_file_name` argument of `TorchProfiler`, which supports passing `None` to disable uploading trace files. Prior to this commit, passing `None` to `Profiler` to do this whilst using a static type checker led to a type error. * fix: when there is no train_metrics, do not checkpoint (#2502) * Remove metric saving (#2514) * no metric save * fix docs * checkdown * fix tests * filter warning * move to device * fix device gpu * Update composer/core/state.py Co-authored-by: Daniel King <[email protected]> --------- Co-authored-by: Daniel King <[email protected]> * Fix daily tests by removing gpu marker (#2515) * Refactor mosaic_fsdp.py (#2506) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * fix pr (#2517) * Add custom sharding to ChunkShardingSpec (#2507) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * Fix import path * Monkey patch ChunkShardingSpec to dynamically detect sharding dim * Format file * Add non divisible functionality to ChunkShardingSpec * Format file * Format file * Update nightly docker image to torch nightly 09-03-23 (#2518) * Update pre-commit in setup.py (#2522) * Add FSDP custom wrap with torch 2.1 (#2460) * add torch2 * add code * tag more changes * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Vitaliy Chiley <[email protected]> * monkeypatch init * raise pins * add print * more logs * change if statements * remove imports * remove imports * fix init * fix versioning * add hybrid shard * checkdown * revert hsdp * add peak memory stats * lint * imports * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Daniel King <[email protected]> * fix wrap * fix gate * lint * test * change thresh * import typing * fix checks * nuke pyright * typo * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <[email protected]> * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <[email protected]> * Update composer/trainer/mosaic_fsdp_utils.py Co-authored-by: Brian <[email protected]> * resolve comments * add comments * add comments --------- Co-authored-by: Vitaliy Chiley <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: Brian <[email protected]> * Fix GCSObjectStore bug where hmac keys auth doesn't work (#2519) * prelim commit * add output logger * create eval output logger * change dist reduce fx * Bump gitpython from 3.1.34 to 3.1.35 (#2525) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.34...3.1.35) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pytest from 7.4.0 to 7.4.2 (#2523) Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.0 to 7.4.2. - [Release notes](https://github.com/pytest-dev/pytest/releases) - [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst) - [Commits](pytest-dev/pytest@7.4.0...7.4.2) --- updated-dependencies: - dependency-name: pytest dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Upgrade to mlflow version 2.5.0 (#2528) * disable cifar daily (#2527) * mosaicml logger robustness improvements (#2530) * Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation (#2531) * Fix github actions for GCS integration testing (#2532) * fix github actions * make gpu test * change dist reduce fx * fix pyright * Fix GCS tests (#2535) * add PR tests * fix test * remove pr daily * remove pr daily * finish error logging cb * fix * add import to init * add import to init * add import to init * add file writing * add file writing * add file writing * add file writing * add file writing * move tensors to cpu * remove tensors * remove tensors * remove tensors * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * try debugging dist sync issue * nit * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * fix syncing of non tensor state * added gpu test * fix error * finish testing callback * fix all errors * test commit * roll back test commit * remove ranks * re-tesT * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * finish pr * implement early stop * add tesT * merge * fix * finish * finish * fix bug * finish * bug fix * add keys * add correcT * modify sync * diff split * fix typo * edit condition * broken wip * design demonstration commit * simplify pr * further simplify * wip * add comments * add other icl metrics * wip * change dict method, add more stuff to logging * fix typos, change some comments * decode tensors, fix wrong dict key * fix mc * 1 to 0 lol * wip linting * adjust to step logging * adjust logging names * add mflow, rm batch keys * add comments, check for dict in huggingface model update_metric * add user specified logging * move metric_name duplication to update_metric * wip fix testing * fix input shape error * rm init * rm eval_after_all * step=None * step=state.timestamp.batch.value * update name to include step * linting, wip on test * fix test * pyright wip * add non-batch warning * pyright * debug * rm this commit that wasn't the right branch * log at the end of training * rm silly wandb table logging * add run_name * add docstring * add debug logging * more logging * rm info logging * improve comments * Update composer/callbacks/eval_output_logging_callback.py Co-authored-by: Evan Racah <[email protected]> * rm logging bool * fix logging for schema tasks * fix schema / mc tasks * yapf * rm reshape * fix tests * cleanup test * pyright * pyright * docstring * pyright * update tests * rm attention mask requirement * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <[email protected]> * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <[email protected]> * rm todo * lint * lint * lint * more lint --------- Signed-off-by: Prithvi Kannan <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Jeremy Dohmann <[email protected]> Co-authored-by: Jeremy D <[email protected]> Co-authored-by: Charles Tang <[email protected]> Co-authored-by: Rishab Parthasarathy <[email protected]> Co-authored-by: Prithvi Kannan <[email protected]> Co-authored-by: Evan Racah <[email protected]> Co-authored-by: eracah <[email protected]> Co-authored-by: Irene Dea <[email protected]> Co-authored-by: Daniel King <[email protected]> Co-authored-by: nik-mosaic <[email protected]> Co-authored-by: bandish-shah <[email protected]> Co-authored-by: Bandish Shah <[email protected]> Co-authored-by: bcui19 <[email protected]> Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Your Name <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Scott Stevenson <[email protected]> Co-authored-by: furkanbiten <[email protected]> Co-authored-by: Brian <[email protected]> Co-authored-by: Vitaliy Chiley <[email protected]> Co-authored-by: Nicholas Garcia <[email protected]> Co-authored-by: Mikhail Kolesov <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Tessa Barton <[email protected]>
What does this PR do?
Log eval outputs after each batch using
logger.log_table
. This is an alternate design to logging at the end of eval found here.REQUIRES LLM-FOUNDRY BRANCH: mosaicml/llm-foundry#961
Eval only run: wandb
run name:
test-batch-logging-kuIoME